Red wine is one of the most beautiful drinks, so it’s going to be interesting to find out what makes a good wine ! :)
The data can be downloaded from this link, also you can find it on my github along with other report resources : link .
Also read this text file which describes the variables and how the data was collected.
The data-set contains 11 chemical characteristics beside a quality from 1 to 10 from at least 3 wine experts for 1599 different wines!
wine <- read.csv('./data/wineQualityReds.csv')
The data has 1599 observations of 13 variables.
The type of data in each column is as follow :
str(wine)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Also the units of each column :
Input variables (based on physicochemical tests):
1. fixed acidity (tartaric acid - g / dm^3)
2. volatile acidity (acetic acid - g / dm^3)
3. citric acid (g / dm^3)
4. residual sugar (g / dm^3)
5. chlorides (sodium chloride - (g / dm^3)
6. free sulfur dioxide (mg / dm^3)
7. total sulfur dioxide (mg / dm^3)
8. density (g / cm^3)
9. pH
10. sulphates (potassium sulphate - g / dm3)
11. alcohol (% by volume)
Output variable (based on sensory data):
12. quality (score between 0 and 10)
Lets look closer on each variable alone, these density plots shows the normal distribution for each variable.
## 80% of the records lies in between 6.5 and 10.7 which is 37.2% of the graph.
## 50% of the records lies in between 7.1 and 9.2 which is 18.6% of the graph.
## 80% of the records lies in between 0.31 and 0.745 which is 29.8% of the graph.
## 50% of the records lies in between 0.39 and 0.64 which is 17.1% of the graph.
## 80% of the records lies in between 0.01 and 0.522 which is 51.2% of the graph.
## 50% of the records lies in between 0.09 and 0.42 which is 33% of the graph.
## 80% of the records lies in between 1.7 and 3.6 which is 13% of the graph.
## 50% of the records lies in between 1.9 and 2.6 which is 4.8% of the graph.
## 80% of the records lies in between 0.06 and 0.109 which is 8.2% of the graph.
## 50% of the records lies in between 0.07 and 0.09 which is 3.3% of the graph.
## 80% of the records lies in between 5 and 31 which is 36.6% of the graph.
## 50% of the records lies in between 7 and 21 which is 19.7% of the graph.
## 80% of the records lies in between 14 and 93.2 which is 28% of the graph.
## 50% of the records lies in between 22 and 62 which is 14.1% of the graph.
## 80% of the records lies in between 0.994556 and 0.99914 which is 33.7% of the graph.
## 50% of the records lies in between 0.9956 and 0.997835 which is 16.4% of the graph.
## 80% of the records lies in between 3.12 and 3.51 which is 30.7% of the graph.
## 50% of the records lies in between 3.21 and 3.4 which is 15% of the graph.
## 80% of the records lies in between 0.5 and 0.85 which is 21% of the graph.
## 50% of the records lies in between 0.55 and 0.73 which is 10.8% of the graph.
## 80% of the records lies in between 9.3 and 12 which is 41.5% of the graph.
## 50% of the records lies in between 9.5 and 11.1 which is 24.6% of the graph.
## 80% of the records lies in between 5 and 7 which is 40% of the graph.
## 50% of the records lies in between 5 and 6 which is 20% of the graph.
Lets focus on quality.
Although quality are supposed to be from 0 to 10, all records are from 3 to 8, the density of each one is as follow :
table(wine$quality)
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
82.5 % of wines either have quality of 5 or 6 .
Let’s zoom into the correlation between quality and the chemical characteristics :
| variable | Pearson corr |
|---|---|
| fixed.acidity | 0.12 |
| volatile.acidity | -0.39 |
| citric.acid | 0.23 |
| residual.sugar | 0.01 |
| chlorides | -0.13 |
| free.sulfur.dioxide | -0.05 |
| total.sulfur.dioxide | -0.19 |
| density | -0.17 |
| pH | -0.06 |
| sulphates | 0.25 |
| alcohol | 0.48 |
as we can see the only relatively good correlation is with the alcohol percentage.
One other way to see the relations is by drawing boxplots .
The following graphs represents boxplots between each quality level [3-8], versus each chemical.
The mean increases from level 4 to 7 .
The mean decreases from level 3 to 7, and increases a little to 8.
The mean remains the same from 3 to 4 then increases to 7 then remains to 8 .
The mean slightly decreases from 3 to 8.
The mean significantly decreases from 3 to 4, then slowly decreases all over the way to 8.
The mean increases from 3 to 5, then decreases from 5 to 8.
The same as free sulfur dioxide, the mean increase from 3 to 5, then decreases from 5 to 8.
The mean decreases from 3 to 4 , and from 5 to 8, but increases from 4 to 5.
The mean remains the same between 3 to 4 , and 5 to 6, and decreases otherwise.
The mean slowly increases all over the way.
The mean significantly increases from 5 to 8, and from 3 to 4 , but decreases from 4 to 5.
So why we are doing that, lets remember what we are seeking for, we want relations between alcohol and the chemical properties.
Correlations gave us the relation with alcohol only but no the others.
But when we saw the boxplots we saw many increases and decreases from different quality level, and we saw the relation between quality and alcohol isn’t perfectly positive.
That leads us to the question in the next part..
which chemcical chracterestics influence the quality, or it there any relation between them !
Logic says yes, but correlations says no except for alcohol, and boplots shows some relations.
Lets think in some different way, instead of searching for the direct relation between each characteristic and quality, let’s find relations between chemical characteristics and each other.
The below correlation table is a good way to find these relations.
The correlations are computed using both Pearson and spearman algorithms, so each element in the table is structured as : Pearson’s / spearman’s .
Correlations bigger than .3 or less than -.3 are colored in Red.
| – | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol |
|---|---|---|---|---|---|---|---|---|---|---|---|
| fixed.acidity | 1 | ||||||||||
| volatile.acidity | -0.26 / -0.28 | 1 | |||||||||
| citric.acid | 0.67 / 0.66 | -0.55 / -0.61 | 1 | ||||||||
| residual.sugar | 0.11 / 0.22 | 0 / 0.03 | 0.14 / 0.18 | 1 | |||||||
| chlorides | 0.09 / 0.25 | 0.06 / 0.16 | 0.2 / 0.11 | 0.06 / 0.21 | 1 | ||||||
| free.sulfur.dioxide | -0.15 / -0.18 | -0.01 / 0.02 | -0.06 / -0.08 | 0.19 / 0.07 | 0.01 / 0 | 1 | |||||
| total.sulfur.dioxide | -0.11 / -0.09 | 0.08 / 0.09 | 0.04 / 0.01 | 0.2 / 0.15 | 0.05 / 0.13 | 0.67 / 0.79 | 1 | ||||
| density | 0.67 / 0.62 | 0.02 / 0.03 | 0.36 / 0.35 | 0.36 / 0.42 | 0.2 / 0.41 | -0.02 / -0.04 | 0.07 / 0.13 | 1 | |||
| pH | -0.68 / -0.71 | 0.23 / 0.23 | -0.54 / -0.55 | -0.09 / -0.09 | -0.27 / -0.23 | 0.07 / 0.12 | -0.07 / -0.01 | -0.34 / -0.31 | 1 | ||
| sulphates | 0.18 / 0.21 | -0.26 / -0.33 | 0.31 / 0.33 | 0.01 / 0.04 | 0.37 / 0.02 | 0.05 / 0.05 | 0.04 / 0 | 0.15 / 0.16 | -0.2 / -0.08 | 1 | |
| alcohol | -0.06 / -0.07 | -0.2 / -0.22 | 0.11 / 0.1 | 0.04 / 0.12 | -0.22 / -0.28 | -0.07 / -0.08 | -0.21 / -0.26 | -0.5 / -0.46 | 0.21 / 0.18 | 0.09 / 0.21 | 1 |
fixed acidity is correlated to citric acid, density and pH.
volatile acidity is correlated to citric acid and sulphates.
citric acid is correlated to volatile, fixed acidity, pH and sulphates.
chlorides is correlated to density and sulphates.
density is correlated to fixed acidity, alcohol, residual sugar and chlorides.
pH is correlated to fixed acidity and citric acid.
sulphates is correlated to volatile acidity, citric acid and chlorides.
residual sugar is correlated to density.
alcohol is correlated to density.
So we have 7 parent nodes which has children :
Quality, Alcohol, Density, Fixed Acidity, Chlorides, Citric acid and Volatile acidity.
And all of them depend on each other, so as we know alcohol affects quality, alcohol is affected by density which is affected by other chemicals which is affected…. and so on.
With counting negative and positive correlations, quality value increases when the following happen :
volatile acidity
pH
Sulphates
Citric acid
pH
Sulphates
Fixed acidity
Residual sugar
Chlorides
Density
Alcohol
Quality
Lets go back to our question, WHAT CHEMICAL PROPERTIES INFLUENCE THE QUALITY.
To answer that we must go through the earlier tree from the bottom to the top.
The below plots explain that, the fist plot has the Quality as Y(dependent), then the next variable in the tree will be the Y of the next plot and so on .
Lets start with the top element [Quality].
Quality is positively correlated with alcohol, the are a few drop-off points above and below the linear line, lets look to alcohol.
The mean and quantile lines goes up and down but still there is a relation. Lets have a look on density.
Density depends on three variables (fixed acidity, chlorides, and residual sugar), and it’s the line of mean matches the line of best fit.
As shown fixed acidity has positive relation with citric acid and negative one with pH.
Also citric acid has positive relation with sulphates and negative relation with both pH and volatile acidity.
After we proved the relation between quality and chemical properties, lets build a regression model so in future if we have chemical properties for some wine, we can predict it’s quality.
Simple linear regression uses an independent variable to predict the outcome of a dependent variable.
we will use the formula Y ~ X , where X represents the relations represented above in the tree.
Because the variables are from different scales, so it would be nicer if all of them are scaled to the same scale. I’ll choose the scale from 0 to 10 , so every element in each variable will have a value from 0 to 10 keeping the statistics not changed.
A new variable is set for the new data called ‘wine.ratio’.
Now lets look at the model :
reg_lm <- lm( quality ~
alcohol * density +
density * fixed.acidity +
density * residual.sugar +
density * chlorides +
chlorides * sulphates +
fixed.acidity * pH +
fixed.acidity * citric.acid +
citric.acid * pH +
citric.acid * volatile.acidity +
citric.acid * sulphates+
volatile.acidity * sulphates
,data = wine.ratio )
Slopes :
| variable | slope |
|---|---|
| alcohol | 0.205*** |
| density | -0.012 |
| fixed.acidity | -0.013 |
| residual.sugar | -0.081 |
| chlorides | 0.133 |
| sulphates | 0.217** |
| pH | -0.028 |
| citric.acid | 0.110 |
| volatile.acidity | -0.202*** |
| alcohol x density | -0.003 |
| density x fixed.acidity | -0.006 |
| density x residual.sugar | 0.015 |
| density x chlorides | -0.020 |
| chlorides x sulphates | -0.036* |
| fixed.acidity x pH | 0.031* |
| fixed.acidity x citric.acid | -0.006 |
| pH x citric.acid | -0.036** |
| citric.acid x volatile.acidity | 0.015 |
| sulphates x citric.acid | -0.001 |
| sulphates x volatile.acidity | -0.004 |
Intercept and some statistics :
| type | value |
|---|---|
| Intercept | 5.230*** |
| R-squared | 0.362 |
| adj. R-squared | 0.354 |
| sigma | 0.649 |
| F | 44.847 |
| p | 0.000 |
| Log-likelihood | -1566.812 |
| Deviance | 664.474 |
| AIC | 3177.623 |
| BIC | 3295.920 |
| N | 1599 |
Lets have some visualization for our model.
The first graph is boxplots for the formula Y ~ X, where Y (wine Quality) as a factor on the x-axis, and X is as shown above the relations between chemical properties and each other on the y-axis.
I’ll use the new data-set here wine.ratio.
As shown above the mean of X is getting higher as quality get higher for the quality ( 3,5,6,7), an exception for 4 and 8, the mean of X at quality 4 is lower than the mean at 3, and the mean at quality 8 is lower the mean at quality 7.
But still we can say the as quality increases the X increases.
The second one will show the difference between the actual quality, and the quality predicted by the regression model. Lets start first make a new column in the data called “quality.predicted”, it will hold the predicted data using the regression model.
wine$quality.predicted <- round( predict(reg_lm, wine.ratio ) )
Now lets plot it :
The bars shows the number of wines having a quality x.
The red ones for the actual quality, and the blue are for the predicted quality.
Most of the predicted quality are 5 and 6, and a little of 7.
The model couldn’t predict the quality of 3,4 and 8, instead it predicted 5 and 6 more than the actual one.
We started by wondering about the relation between the quality of wine and it’s chemical properties, it’s clear that there must be a relation, although we proved some week relation but it still week and we can’t count on it .
So how does this make sense !, If we trusted that the chemical test were true and there is no error in the data, so there is error in the human factor !, lets not to forget that the quality is entered by humans and humans always make mistakes!.
So I believe to some degree that many values of the quality are entered from person favorite not because it’s actually high quality.
I chose three plots to summary the analysis we did :
The first one the boxplot graph between alcohol and quality which shows how quality is affected by alcohol precentage.
the highest quality level has mean near to 12% alcohol, and the lowest quality level has a mean near to 10% of alcohol .
And if we considered the 2% difference to not be big, the graph show the opposite, as the alcohol level goes higher from 10 to 12 the quality level goes higher.
The second one is the graph which shows the difference between the actual quality, and the quality predicted by the regression model.
The model predicts the quality 5 and 6 much more higher than other levels.
The third graph is relation between X and Y in the linear regression model.
What is interesting is that it shows the relation between the quality and all the chemical properties in one understandable graph. Also it proves that quality increases with the model although the effect is tiny.
For the analysis we just did, I could find the relations between quality and chemical properties, some of them aren’t easy to find as shown in the boxplots graphs, Also I couldn’t be sure that the relation is real not just a coincidence.
I struggled in finding out most fit formula for the linear model, I tried my best and I hope the one I chose is good enough.
So how we really get that relation between wine’s quality and it’s chemical properties ?.
I don’t believe that diving deeper in this data set would give me the answer. So to get the answer we have to find the best data set for it, maybe that data would contain prices, brands, and more accurate quality or drinkers’ review.
Also chemical properties aren’t everything that matters in wine, there still the type of the grape used, the quality of wine brand, any flavors added and other things that haven’t been considered in the data-set.
Another thing, the fact the most of the quality values are 5 or 6 makes it harder to analysis the data, there are no very good wines ( of quality 9 or 10), or very bad wines ( of quality 0, 1 or 2), which confirms also that the data aren’t strong enough to use it and as I said humans make mistakes.
The data-set used in this report :
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at:
Elsevier
Pre-press (pdf)
bib